Source code

Github project page.

Data

From ~2.5 million spectra, we aim to detect Lyman-break galaxies (LBGs) by comparing against a known template. In particular, good candidates should:

  1. Match key absorption troughs (nearly) exactly.
  2. Not exhibit signs of broad higher-wavelength emissions (likely to be quasars).

The first requirement is easier than the second, since we are provided a high-quality LBG template to compare with. If a similarly high-quality template were to be provided for quasars, it should be possible to substantially improve these results.

Considerations

The method started simple but grew more complex over time with testing.

The main consideration in designing the method was robustness. Every step in the process should be as robust as possible against:

  1. Noise
  2. Arbitrary vertical & horizontal shifting
  3. Arbitrary vertical scaling

Carefully selected smoothing methods were effective at reducing anticipated noise, convolution matching of absorption troughs was effective at correcting for horizontal shifts, and regressing on quantiles was effective at correcting vertical shifting/scaling. More details are provided below.

Method

The process was carefully constructed and tuned with the help of several R Shiny debugging apps I made (see repo for details). Below is a more in-depth overview:

  1. The spectrum is first trimmed at both ends to remove Gibb’s artifacts.
  2. Then, it’s gently low-pass smoothed by truncating its (fast) Fourier transform.
  3. Next, to emphasize absorption troughs, we took the difference between the spectrum and an aggressively loess-smoothed version of itself.
  4. This was then convolved with a similarly trough-emphasized version of the template (cB58), which was used in two ways:
    1. First, the index of the global maximum was recorded as the red/blue-shift correction (this works because the wavelength is already in log-scale).
    2. Then, the convolved result is transformed to reduce noise and emphasize potential matches. The normalized area under the index found above, called \(A_{\text{peak}}\), is calculated and saved.
  5. Now, we slide our template to correct for red-shift and rescale by using simple linear regression (SLR) to match fixed quantiles in the first two-thirds of both spectra.
    1. Note SLR is merely used to match points that be vertically shifted/scaled (thus, the quantiles don’t even need to be linear).
    2. The first two-thirds were used since it kind of appears (this needs better investigation!) that quasars may have more broad emissions to the side, so matching part of the spectrum may help distinguish these cases.
  6. Penultimately, we take the Kolmogorov-Smirnov difference statistic between the entirety of both spectra, and transform it into \(K_{\text{trans}}\) to penalize (up to a point) spectra with higher statistic (i.e. larger difference) than a noisy version of our template.
  7. Lastly, we simply take the product \(C=A_{\text{peak}}\cdot K_{\text{trans}}\) as our composite score. Note both \(A_{\text{peak}},K_{\text{trans}}\in[0,1]\) with higher being better.

Steps 5-6 above are kind of a hack. If we had a high-quality quasar template to match against, we may be able to replace it with more elegant code that also achieves higher specificity (this is only a conjecture).

Results

A table of top 500 results sorted by ranked by composite score can be viewed here.

Plots of some of the best matches are shown below. Most info in the title/label are for my debugging purposes. Notably, the offset index (red-shift correction) and \(A_{\text{peak}}\), \(K_{\text{trans}}\), and \(C\) are shown in the title. The template was shifted down from the spectra for legibility.

For each spectrum, the composite score is also shown on the side for convenience. Initially, I was going to give a (likely very poor) guess for each one what type of astronomical body it may be, but I ultimately decided against it since I really have no idea 🙃

Here are the top 50 or so spectra (with a few visually ruled out). There seem to be quite a few potential quasars in here with broad emission bands CIV and CIII (some of which are quite visible).

   
Score:
0.998173
Score:
0.996457
Score:
0.996200
Score:
0.993031
Score:
0.990401
Score:
0.990346
Score:
0.989884
Score:
0.988500
Score:
0.987187
Score:
0.986188
Score:
0.985287
Score:
0.984728
Score:
0.983821
Score:
0.983571
Score:
0.983414
Score:
0.982770
Score:
0.982456
Score:
0.981456
Score:
0.981149
Score:
0.978160
Score:
0.975860
Score:
0.974586
Score:
0.973305
Score:
0.972581
Score:
0.971594
Score:
0.971582
Score:
0.969833
Score:
0.969629
Score:
0.969139
Score:
0.969034
Score:
0.966476
Score:
0.964608
Score:
0.962382
Score:
0.960429
Score:
0.959924
Score:
0.959898
Score:
0.958947
Score:
0.958364
Score:
0.957592
Score:
0.955886
Score:
0.954140
Score:
0.953548
Score:
0.953100
Score:
0.952959
Score:
0.952637
Score:
0.951155
Score:
0.950995
Score:
0.949865
Score:
0.949348
Score:
0.948499